On fitting power laws to ecological data 
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Heavy-tailed or power-law distributions are becoming increasingly common in biological literature. 
A wide range of biological data has been fitted to distributions with heavy tails. Many of these 
studies use simple fitting methods to find the parameters in the distribution, which can give highly 
misleading results. The potential pitfalls that can occur when using these methods are pointed 
out, and a step-by-step guide to fitting power-law distributions and assessing their goodness-of-fit 
is offered. 
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I. INTRODUCTION 



In the last fifteen years, power-law distributions, with 
their heavy-tailed and scale-invariant features, have be- 
come ubiquitous in scientific literature in general [l~7l . |22| . 
Heavy-tailed distributions have been fitted to data from 
a wide range of sources, including ecological size spectra 
[25l ] . dispersal functions for spores [H|, seeds [l3| and 
birds [l8|, and animal foraging movements In the 
latter case, the fit of a heavy-tailed distribution has been 
used as evidence that the optimal foraging strategy is a 
Levy walk with exponent /i m 2 prj , However, the 
appropriateness of heavy-tailed distributions for some of 
these data sets, and the methods used to fit them, have 
recently been questioned by Edwards et al [111 ]. 

In a literature search of studies on foraging movements, 
ecological size spectra and dispersal functions, the first 24 
papers that fitted power-law distributions to data were 
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271 . |28| . In only six of these 24 
papers was it clear that the authors correctly fitted a sta- 
tistical distribution to the data, whereas 15 used a flawed 
fitting procedure (and in the remaining three it was un- 
clear). Ten of the studies compared the goodness-of-fit 
of the power-law distribution to other types of distribu- 
tion, but at least two of these ten used comparison meth- 
ods that were invalid in this context. Correctly fitting a 
power-law distribution to data is not difficult but requires 
a little more statistical knowledge than the standard lin- 
ear regression and linear correlation techniques used in 
most cases. 

In this report, the most common methods currently 
used to fit a power law to data are reviewed, and some of 
the drawbacks of these methods are discussed. A method 
that avoids these drawbacks, based on maximum likeli- 
hood estimation, is then described in a step-by-step guide 
(the reader is also referred to [LH). Two case studies are 
presented illustrating the advantages of this method. 



linear regression; maximum likelihood; statistical 



II. METHODS 

Linear regression methods 

The probability density function (PDF) for a power- 
law distribution has the form 



p(x) 



where fi > 1. 



This function is not defined on the range to 00, but 
on the range x m i n to infinity, otherwise the distribution 
cannot be normalised to sum to 1. This condition implies 
that the exponent fi is related to a; m i n and c by 



c = (/x — 1) x 



min 



A common problem is to determine whether a partic- 
ular sample of data is drawn from a power-law distribu- 
tion and, if so, to estimate the value of the exponent fi. A 
standard approach to this problem (used by 14 of the pa- 
pers in the literature sample) is to bin the data and plot 
a frequency histogram on a log-log scale. If the sample 
is indeed drawn from a power-law distribution and the 
correct binning strategy is used, there is a linear rela- 
tionship between In a; and lap(x), so the frequency plot 
will produce a straight line of slope — /1 22]. Alterna- 
tively, one may use the cumulative distribution function 
(CDF) C(x), which shows a linear relationship between 
In a; and In (1 — C(x)). So plotting 1 — C(x) (i.e. the rel- 
ative frequency of observations with a value greater than 
x) against x (called a rank-frequency (RF) plot) on a 
log-log scale will give a straight line of slope 1 — \i. 

All data sets are subject to statistical noise, particu- 
larly in the tail of the distribution. Hence a frequency 
chart will never produce a perfectly straight line, so some 
kind of fitting procedure is necessary. The most common 
method, used by 16 of the 24 studies surveyed, is to esti- 
mate the power-law exponent by using linear regression 
to find the line of best fit on the frequency histogram or 
RF plot. 



Drawbacks of the linear regression method 
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Linear regression assumes that one is free to vary both 
the slope and intercept of the line in order to obtain 



2 



the best possible fit. However, this is not true in the 
case of fitting a probability distribution, because of the 
constraint that the distribution must sum to 1. Once the 
range of the data has determined x m i n , the power-law 
distribution has only one degree of freedom /i, as opposed 
to the two afforded by the naive linear regression or line 
of best fit approach. Unless this constraint is explicitly 
acknowledged and respected, the fitting procedure will 
produce an incorrect estimate of the exponent /i and the 
fitted distribution will not be a PDF on the appropriate 
x range (see for example Case Study 1). 

Furthermore, the linear regression method does not of- 
fer a natural way to estimate the size of the error in the 
fitted value of /j, (very few of the studies surveyed made 
any attempt to do this). A point value for \x on its own 
is of questionable merit. 

A third criticism of the linear regression method of fit- 
ting a power law to some sample is that very rarely is any 
meaningful attempt made to judge the goodness-of-fit of 
the proposed model. The most common approach, taken 
by twelve of the surveyed works, is to provide the corre- 
lation coefficient r 2 . This is a measure of the strength of 
linear correlation between In a; and hxp(x), but is not a 
measure of the goodness-of-fit of a proposed model such 
as a power law. Of course, in the absence of any a priori 
knowledge of the distributions of errors in the data, mea- 
suring goodness-of-fit of a single model is extremely dif- 
ficult. However, it is possible to compare goodness-of-fit 
of two or more candidate models. In the work surveyed, 
nine papers explicitly compared the goodness-of-fit of the 
power-law distribution with other candidate models, but 
at least two of these nine used a comparison based on the 
flawed linear regression method and on r 2 , so in reality 
added little to the analysis. 

Finally, although it is not necessary to bin the data 
to use a linear regression method, 14 of the 24 surveyed 
papers did so in their analyses, and nine used bins of 
equal width. It should be noted that, when a sample 
that does follow a power-law distribution is binned using 
fixed width bins, the log-log plot of the PDF is not lin- 
ear. To see a linear relationship one must use logarithmic 
width bins. This fitting approach is described in detail 
in [22l |. but the statistical results are often sensitive to 
the binning strategy (in particular the logarithmic base) 
used [25[ and the occurrence of empty bins is problem- 
atic. Case Study 1 illustrates some of the nonsensical 
conclusions that can result from binning the data. By 
far the best approach to fitting a power law is not to bin 
the data in any way, but to use the raw data whenever 
possible. 



The maximum likelihood method 

A simple technique for fitting a power-law distribution 
is to use maximum likelihood estimation (MLE). This 
method provides an unbiased estimate of the exponent /i 
[l7| . Equally importantly, it provides an estimate of the 



error in the fitted value of (j,, and allows the goodness- 
of-fit of different candidate models to be directly com- 
pared ||. Furthermore, the method does not require 
the data to be binned in any way. In the step-by-step 
guide below, the method is used to compare the fit of a 
power-law distribution and an exponential distribution, 
p{x) = Ae~ A ( :E ~ :Emin ), on the same range (x m i n ,oo), but 
can be easily adapted to compare with other candidate 
distributions (e.g. normal distribution, gamma distribu- 
tion). 

The method proceeds in the following steps. 

1. Draw a RF plot. To gain a picture of the distri- 
bution of the data, plot the fraction of observations 
greater than x against x on a log-log plot by: 

• ordering the data from largest to smallest so 
that x\ > X2 > ■ ■ ■ > xn] 

• plotting ln(^) against hxxj (for j = 
1,...,N). 

2. Choose x m ; n . In some cases, it may be desirable 
to examine only the 'tail' of the distribution by first 
discarding from the sample all values less than some 
cut-off x m i n . Typically, this is chosen so as to dis- 
regard the curved part of the RF plot on its left- 
hand side. If it is required to fit a distribution to 
the entire sample, take x m i n to be the smallest ob- 
servation. 

3. Calculate MLE parameters and likelihoods: 

• MLE power-law exponent 

Mmle = 1 + M [J2 In — J , 

\i=i Xmin J 

• log-likelihood for power-law model 

£pow = M (ln(^MLE - 1) - lnXmm) 
M 

—Mmle ^ ha — — , 

i— 1 

• MLE exponential parameter 

/ m y 1 
Amle = M }Xxj - x miu ) 

• log-likelihood for exponential model 

M 

£cx P = M In Amle — Amle /A^i — x min), 

»=i 

where M is the number of data points in the (trun- 
cated) sample. See [l?} for a derivation of these 
formulae. 
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4. Select the best model. For each model, calculate 
the Akaike information criterion (AIC) and Akaike 
weight (w), which are defined, for model i, by 

AIC t = -2Ci + 2Ki 

yj ■ == 1 

• E ^ 1 exp(- AIC ^ 2 AIC - i " ) 

where Ki is the number of parameters in model i 
(K = 1 for the power-law and exponential models), 
p is the total number of models being compared 
(in this case, comparing a power-law model and an 
exponential model means that p = 2) and AIC m i n 
is the smallest of the AIC across all p models. 

The Akaike weight gives a measure of the likelihood 
that a particular model provides the best represen- 
tation of the data. If w p0 w > w cxp then a power- 
law distribution provides a better fit to the data 
than an exponential distribution, and the estimated 
value of the exponent of the power law is umle- If 
M) pow < w cxp then an exponential distribution is 
favoured over a power law on that range. 

5. Calculate the error in the estimate of [i. The 

standard deviation a of the estimated power-law 
exponent ^mle may also be calculated: 

This gives the average size of the error in the esti- 
mated value of fi. 

Six of the 24 surveyed studies correctly fitted the data 
and compared different candidate models using a method 
similar to the one described above. (It is interesting to 
note that these six were all found in the dispersal litera- 
ture.) 

III. RESULTS 

Case study 1 

Austin et al [l[ fitted power-law distributions to the 
movement lengths of grey seals; Figure [1] shows an ex- 
ample data set consisting of 96 observations. The origi- 
nal data are already binned, so it was necessary, for the 
re-analysis, to assume that all observations in the sam- 
ple were at the midpoint of their respective bin. Us- 
ing linear regression, the line of best fit has equation 
Y = —1.26 A — 0.474 (in log coordinates), leading to an 
estimated value of the exponent of p = 1.26. However, it 
was omitted from [l[ that, in order for the distribution 
to sum to 1, this line of best fit implies an x m i n value of 
31 km. As the data are in the range 2 km to 15 km, it is 
nonsensical to fit a distribution with range 31 < x < oo. 
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FIG. 1: Relative frequency histogram of seal movement 
lengths x (in km) on a log-log scale, together with the line of 
best fit ln(p(x)) = -1.2573 ln(x) - 0.4741. This line is not a 
probability distribution on an appropriate x range. 

These data were re-analysed in 22]. This work used 
logarithmic binning but, again, used linear regression to 
find the line of best fit and estimated the exponent to be 
/.t = 0.8. This exponent cannot be correct as any power 
law with an exponent of less than 1 cannot be normalised 
to sum to 1 and hence is not a probability distribution. 
This illustrates the major drawbacks of using linear re- 
gression on binned data: the fitted line may be sensitive 
to the size and number of bins used and to the occurrence 
of empty bins, and may result in a power law that is not 
an appropriate probability distribution. 

Using the maximum likelihood method outlined above, 
the estimated exponent is /xmle = 2.05 (with an expected 
error of a = 0.108). This explicitly presumes a minimum 
data value of a; m i n = 2 km (the smallest observation in 
the sample). However, comparing to an exponential dis- 
tribution as described above shows that the exponential 
distribution (with A = 0.246) provides a much better 
model for this data set than a power law (u> pc ,w < 10 -7 ). 
Hence, the power-law hypothesis for this data set can be 
confidently rejected. 

Case study 2 

The data set shown in Appendix [5] contains the land 
area (in hectares) of 789 farms in the Hurunui district 
of New Zealand; this data set was chosen for illustra- 
tive purposes. The RF plot for this data set is shown 
in Figure [2ja). We find the maximum likelihood esti- 
mate of the power-law exponent for a range of values of 
the minimum cut-off x m m (discarding all data less than 
x m in) using the method described above. By calculating 
the Akaike weights, it is possible to determine whether an 
exponential or a power-law distribution provides the bet- 
ter fit to the data, for any given value of x m - ln . A power 
law is favoured over an exponential (w pow > w cxp ) for all 
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FIG. 2: (a) RF plot for the farm size data set. (b) Graph 
of the estimated exponent /xmle against the cut-off x m i n . A 
power law provides a better fit to the data than an exponen- 
tial distribution for value of x mln between 250 and 1810; an 
exponential provides a better fit for values of x m in outside this 
range. 



choices of a; m ; n between 250 and 1810, but an exponential 
is favoured (w p0 w < w cxp ) for x mm < 250. Figure HJb) 
shows the estimated power-law exponent /imle for the 
different a; m ; n values: it is clear that the estimated expo- 
nent has a wide range of values from 2 to 3.3, depending 
on the range of data used. It should also be noted that, 
if the aim is to find the best statistical model describing 
the entire data set (i.e. x m { n equal to the smallest obser- 
vation in the sample), then an exponential distribution 
(with A = 2.33 x 10~ 3 ) is clearly favoured over a power 
law (w pow < 10~ 65 ). The dependence of the outcome 
of the analysis on the arbitrary selection of x m i n further 
illustrates the potential problems with fitting power-law 
distributions. 



IV. DISCUSSION 

Some of the statistical procedures commonly used to 
fit power-law distributions to ecological data often lead 
to misleading or incomplete conclusions pd| . In this pa- 
per, a simple step-by-step method for fitting a power law 
to a data set has been described. This method, based 
on maximum likelihood estimation, provides an unbiased 
estimate of the power-law exponent, as well as the ex- 
pected error in this value, and assesses the goodness-of-fit 
of a power-law model compared to alternative candidate 
models such as the exponential distribution. 
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APPENDIX A: FARM SIZE DATA 

The following data set (courtesy of Environment Can- 
terbury) contains the sizes (in hectares) of 789 farms in 
the Hurunui district of New Zealand. 

0.06, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.5, 0.5, 0.53 
0.6, 0.9, 1, 1, 1, 1, 1, 1, 1, 1.3, 1.5, 1.9, 1.9, 2, 2, 2, 2, 2, 2, 2, 2, 2 
2, 2.3, 2.9, 2.9, 2.9, 3, 3, 3, 3, 3, 3, 3, 3, 3.2, 3.3, 3.5, 3.8, 3.9, 4, 4 
4, 4, 4, 4, 4, 4, 4, 4.2, 4.2, 4.3, 4.6, 4.8, 4.9, 5, 5, 5, 5, 5, 5.4, 5.7 
6, 6, 6, 6, 6, 6, 6.2, 6.3, 6.3, 7, 7, 7, 7, 7, 7.2, 7.5, 7.9, 8, 8, 8, 8, 8 
8, 8, 8.3, 8.8, 9, 9, 9, 9, 9.3, 9.9, 10, 10, 10, 10, 10, 10.2, 10.4, 11 
11, 11, 11, 11, 11.8, 12, 12, 12, 12, 12, 12, 12, 12, 12.1, 13, 13, 1 
13, 13.8, 13.9, 14, 14, 14, 14, 14.4, 14.5, 15, 15, 15, 15, 15, 15, 15.2 
15.7, 15.8, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17.1, 17.1, 17.5 
17.6, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19 
20, 20, 20, 21, 21, 21, 21.9, 22, 22, 23, 23, 23, 24, 24, 24, 25, 25, 26 
27, 27, 27, 28, 28, 29, 29, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32.9, 34 
34, 35, 35, 35, 36, 36, 36, 37, 38, 38, 39, 39, 39, 40, 41, 43, 43, 44.1 
45, 45, 46, 46, 46, 46, 47, 47, 47, 47, 48, 49, 50, 53, 54, 55, 56, 56.5 
57, 58, 58, 60, 63, 67, 68, 69, 70, 71, 72, 72, 72, 73, 73, 74, 75, 75 
75, 75, 77, 78, 78, 78, 78, 79, 80, 81, 82, 83, 85, 85, 87, 87, 90, 90 
91, 91, 91, 92, 93, 93, 96, 9, 97, 99, 99, 99, 100, 100, 100, 101, 101 



102 
120 
131 
143 
155 
165 
177. 
192 
200 
223 
239 
267 
280 
299 
324 
353 
372 
392 
415 
450 



103, 103, 104, 104, 105, 106, 107, 109, 109, 113, 115, 117, 118 
121, 121, 121, 124, 124, 124, 124, 125, 126, 12, 128, 128, 131 
132, 133, 135, 135, 136, 136, 137, 137, 137, 138, 141, 142, 143 
143, 143, 144, 144, 145, 145, 148, 148, 149, 149, 150, 151, 153 
15, 156, 157, 157, 158, 160, 161, 162, 162, 162, 163, 164, 164 
167, 167, 169, 169, 170, 171, 171, 172, 173, 173, 173, 175, 176 
181, 182, 182, 183, 183, 18, 184, 185, 186, 188, 188, 190, 191 
192, 192, 193, 193.5, 194, 195, 196, 197, 198, 198, 200, 200 
202, 202, 203, 206, 207, 209, 210, 210, 213, 216, 216, 219, 221 
223, 226, 226, 226, 228, 229, 229, 230, 232, 233, 233, 234, 237 
242, 245, 247, 247, 250, 250, 252, 254, 256, 258, 263, 267, 267 
268, 269, 269, 269, 270, 271, 273, 273, 273, 275, 277, 277, 279 
282, 283, 284, 284, 284, 288, 289, 289, 295, 295, 298, 298, 298 
301, 305, 306, 308, 309, 311, 31, 313, 313, 316, 316, 320, 322 
327, 332, 333, 333, 339, 341, 342, 345, 346, 348, 349, 349, 352 
354, 354, 355, 357, 361, 362, 364, 365, 365, 365, 366, 36, 367 
372, 375, 376, 379, 380, 384, 384, 385, 388, 388, 388, 390, 391 
394, 395, 395, 396, 398, 399, 401, 401, 401, 404, 405, 410, 410 
416, 423, 42, 428, 430, 431, 431, 432, 440, 443, 443, 448, 448 
453, 456, 457, 461, 461, 464, 466, 471, 476, 478, 480, 496, 497 
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498, 502, 504, 504, 505, 511, 512, 513, 51, 515, 517, 518, 520, 524, 
526, 529, 533, 540, 542, 546, 550, 550, 552, 552, 553, 555, 555, 558, 
563, 565, 565, 567, 567, 571, 571, 576, 578, 580, 581, 583, 587, 58, 
591, 597, 597, 597, 598, 602, 611, 612, 614, 614, 616, 630, 635, 635, 
640, 640, 645, 650, 652, 652, 661, 662, 667, 681, 685, 695, 699, 719, 
720, 731, 732, 741, 75, 752, 760, 763, 764, 769, 771, 774, 776, 785, 
790, 792, 792, 804, 807, 814, 824, 843, 844, 849, 859, 874, 886, 892, 
892, 892, 923, 923, 932, 935, 938, 942, 952, 95, 955, 958, 975, 980, 
984, 989, 996, 999, 1002, 1004, 1013, 1040, 1044, 1054, 1059, 1064, 



1082, 1094, 1110, 1117, 1124, 1153, 1172, 1191, 1196, 1201, 1216, 

1219, 1226, 1261, 1262, 126, 1278, 1303, 1312, 1336, 1339, 1364, 

1371, 1372, 1379, 1380, 1387, 1401, 1403, 1405, 1431, 1431, 1439, 

1459, 1477, 1489, 1491, 1521, 1534, 1537, 1544, 1560, 176, 1593, 

1594, 1607, 1608, 1631, 1690, 1765, 1778, 1780, 1780, 1798, 1812, 

1815, 1899, 1917, 1986, 2081, 2152, 2200, 2205, 2288, 2335, 2440, 

2576, 2758, 2975, 3378, 3512, 4126, 4136, 4300, 4466, 5299, 5967, 

6700, 8596, 10336. 
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